{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview\n",
    "\n",
    "这篇博客内容将包括对 XML 文件的解析、追加新元素后写入到 XML，以及更新原 XML 文件中某结点的值。使用的是 python 的`xml.dom.minidom`包，详情可见其官方文档：[xml.dom.minidom 官方文档](https://docs.python.org/2/library/xml.dom.minidom.html)。\n",
    "\n",
    "- 官方关于其他几种xml解析包的说明 [XML Processing Modules](https://docs.python.org/3.8/library/xml.html#xml-vulnerabilities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 解析 XML 文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 导包 \n",
    "\n",
    "parse()和parseString()函数所做的是将一个XML解析器与一个DOM构建器连接起来，这个DOM构建器可以接受来自任何SAX解析器的解析事件，并将它们转换成一个DOM树。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from xml.dom.minidom import parse, parseString"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 解析\n",
    "\n",
    "`parse`和`parseString` 方法都会返回一个表示文档的内容的`Document`对象。\n",
    "\n",
    "### 解析文件\n",
    "\n",
    "传入文件名，或者类似于文件的对象都可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方式一：\n",
    "dom1 = parse('c:\\\\temp\\\\mydata.xml')\n",
    "\n",
    "# 方式二（推荐）： 可自动释放资源\n",
    "with parse('c:\\\\temp\\\\mydata.xml') as dom1:\n",
    "    pass\n",
    "\n",
    "# 方式三：\n",
    "datasource = open('c:\\\\temp\\\\mydata.xml')\n",
    "dom2 = parse(datasource)  # parse an open file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 解析字符串\n",
    "\n",
    "> CDATA：在 XML 中，不会被解析器解析的部分数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "message = '''\n",
    "<!-- This is list of customers -->\n",
    "<customers>\n",
    "  <customer ID=\"C001\">\n",
    "    <name>Acme Inc.</name>\n",
    "    <phone>12345</phone>\n",
    "    <comments>\n",
    "      <![CDATA[Regular customer since 1995]]>\n",
    "    </comments>\n",
    "  </customer>\n",
    "  <customer ID=\"C002\">\n",
    "    <name>Star Wars Inc.</name>\n",
    "    <phone>23456</phone>\n",
    "    <comments>\n",
    "      <![CDATA[A small but healthy company.]]>\n",
    "    </comments>\n",
    "  </customer>\n",
    "</customers>\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'xml.dom.minidom.Document'>\n"
     ]
    }
   ],
   "source": [
    "# 方式一：\n",
    "dom3 = parseString(message)\n",
    "\n",
    "# 方式二：\n",
    "# 将字符串转为 `io.String`对象\n",
    "import io\n",
    "\n",
    "msg = io.StringIO(message)\n",
    "dom4 = parse(msg)\n",
    "print(type(dom4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 方法、属性\n",
    "\n",
    "一旦有了DOM文档对象，就可以通过其属性和方法访问XML文档的各个部分。这些属性在DOM规范中定义。文档对象的主要属性是`documentElement`属性。它提供了XML文档中的**根元素**。\n",
    "\n",
    "**在解析 XML 时，所有的文本都是储存在文本节点中的，且该文本节点被视为元素结点的子结点**，例如：2005，元素节点 ，拥有一个值为 “2005” 的文本节点，“2005” 不是 元素的值，最常用的方法就是 getElementsByTagName() 方法了，获取到结点后再进一步根据文档结构解析即可。\n",
    "\n",
    "具体的理论就不过多描述，配合上述 XML 文件和下面的代码，你将清楚的看到操作方法，下面的代码执行的工作是将所有的结点名称以及结点信息输出一下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<DOM Element: customers at 0x2cc4551d930>"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 文档根元素\n",
    "rootNode = dom4.documentElement\n",
    "rootNode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<DOM Text node \"'\\n  '\">,\n",
       " <DOM Element: customer at 0x2cc456ab800>,\n",
       " <DOM Text node \"'\\n  '\">,\n",
       " <DOM Element: customer at 0x2cc456aba60>,\n",
       " <DOM Text node \"'\\n'\">]"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 获取所有子节点\n",
    "rootNode.childNodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 136,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 获取第一个子节点\n",
    "first_node = rootNode.firstChild\n",
    "# 判断节点类型\n",
    "first_node.nodeType == first_node.TEXT_NODE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 139,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 兄弟节点\n",
    "# 第一个节点的兄弟节点即父节点的第二个子节点\n",
    "first_node.nextSibling == rootNode.childNodes[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<DOM Element: customer at 0x2cc45f2aa60>,\n",
       " <DOM Element: customer at 0x2cc45f2acc0>]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 根据节点名称查找，返回列表\n",
    "rootNode.getElementsByTagName('customer')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<DOM Element: phone at 0x2cc45f2ab90>, <DOM Element: phone at 0x2cc45f2adf0>]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 可以查找任意层级的节点\n",
    "rootNode.getElementsByTagName('phone')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<DOM Element: customer at 0x2cc45f2aa60>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过索引获取一个节点\n",
    "rootNode.getElementsByTagName('customer')[1]\n",
    "\n",
    "rootNode.childNodes[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tagName： customer\n",
      "nodeName： customer\n",
      "ID: C001\n",
      "Text data :  Acme Inc.\n",
      "Text value:  Acme Inc.\n",
      "Text xml  : Acme Inc.\n"
     ]
    }
   ],
   "source": [
    "customer = rootNode.getElementsByTagName('customer')[0]\n",
    "\n",
    "# 节点名\n",
    "print('tagName：', customer.tagName)\n",
    "print('nodeName：', customer.nodeName)\n",
    "\n",
    "# 节点属性\n",
    "if customer.hasAttribute('ID'):\n",
    "    print(\"ID:\", customer.getAttribute(\"ID\"))\n",
    "    \n",
    "# 获取节点文本数据， 先获取一个叶子节点\n",
    "name_node = customer.getElementsByTagName('name')[0]\n",
    "# 再获取其文本节点\n",
    "text_node = name_node.childNodes[0]\n",
    "# 或者\n",
    "# text_node = name_node.firstChild\n",
    "\n",
    "print('Text data : ', text_node.data)\n",
    "print('Text value: ', text_node.nodeValue)\n",
    "print('Text xml  :', text_node.toxml())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<customer ID=\"C001\">\n",
      "    <name>Acme Inc.</name>\n",
      "    <phone>12345</phone>\n",
      "    <comments>\n",
      "      <![CDATA[Regular customer since 1995]]>\n",
      "    </comments>\n",
      "  </customer>\n"
     ]
    }
   ],
   "source": [
    "# 转字符串\n",
    "print(customer.toxml(encoding=None))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<customer ID=\"C001\">  \n",
      "      <name>Acme Inc.</name>  \n",
      "      <phone>12345</phone>  \n",
      "      <comments>    \n",
      "      <![CDATA[Regular customer since 1995]]>    \n",
      "      </comments>  \n",
      "  </customer>\n"
     ]
    }
   ],
   "source": [
    "# 显示缩进\n",
    "print(customer.toprettyxml(indent='  ', newl=''))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 写入文件\n",
    "with open('added_customer.xml', 'w') as f:\n",
    "    # 缩进 - 换行 - 编码\n",
    "    dom4.writexml(f, indent='  ', addindent='', newl='', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 写入 XML 文件\n",
    "\n",
    "在写入时，我觉得可分为两种方式：\n",
    "\n",
    "*   新建一个全新的 XML 文件\n",
    "*   在已有 XML 文件基础上追加一些元素信息\n",
    "\n",
    "至于以上两种情况，其实创建元素结点的方法类似，你必须要做的都是先创建 / 得到一个 DOM 对象，再在 DOM 基础上创建 new 一个新的结点。\n",
    "\n",
    "如果是第一种情况，你可以通过`dom=minidom.Document()`来创建；如果是第二种情况，直接可以通过解析已有 XML 文件来得到 dom 对象，例如`dom = parse(\"./customer.xml\")`\n",
    "\n",
    "在具体创建元素 / 文本结点时，你大致会写出像以下这样的 “四部曲” 代码：\n",
    "\n",
    "*   ①创建一个新元素结点 createElement()\n",
    "*   ②创建一个文本节点 createTextNode()\n",
    "*   ③将文本节点挂载元素结点上\n",
    "*   ④将元素结点挂载到其父元素上。\n",
    "\n",
    "现在，我需要新建一个 customer 节点，信息如下:\n",
    "\n",
    "```xml\n",
    "<customer>\n",
    "    <name ID='003'>蒂法</name>\n",
    "    <phone>32467</phone>\n",
    "    <comments>\n",
    "      <![CDATA[A small but healthy company.]]>\n",
    "    </comments>\n",
    "</customer>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<customer>\n",
      "    <name ID=\"003\">蒂法</name>\n",
      "    <phone>32467</phone>\n",
      "    <comments>\n",
      "<![CDATA[A small but healthy company.]]>    </comments>\n",
      "</customer>\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from xml.dom.minidom import getDOMImplementation\n",
    "\n",
    "impl = getDOMImplementation()\n",
    "\n",
    "newdoc = impl.createDocument(None, \"customer\", None)\n",
    "top_element = newdoc.documentElement\n",
    "\n",
    "split_text_node = newdoc.createTextNode('\\n  ')\n",
    "\n",
    "name_node = newdoc.createElement('name')\n",
    "name_node.setAttribute('ID', '003')\n",
    "name_text = newdoc.createTextNode('蒂法')\n",
    "name_node.appendChild(name_text)\n",
    "\n",
    "phone_node = newdoc.createElement('phone')\n",
    "phone_text = newdoc.createTextNode('32467')\n",
    "phone_node.appendChild(phone_text)\n",
    "\n",
    "comments = newdoc.createElement('comments')\n",
    "comments_text = newdoc.createCDATASection('aaaa')\n",
    "comments.appendChild(comments_text)\n",
    "\n",
    "top_element.appendChild(name_node)\n",
    "top_element.appendChild(phone_node)\n",
    "top_element.appendChild(comments_node)\n",
    "print(top_element.toprettyxml(indent='    '))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第二种方法代码如下：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<customers>\n",
      "    \n",
      "  \n",
      "    <customer ID=\"C001\">\n",
      "        \n",
      "    \n",
      "        <name>Acme Inc.</name>\n",
      "        \n",
      "    \n",
      "        <phone>12345</phone>\n",
      "        \n",
      "    \n",
      "        <comments>\n",
      "            \n",
      "      \n",
      "<![CDATA[Regular customer since 1995]]>            \n",
      "    \n",
      "        </comments>\n",
      "        \n",
      "  \n",
      "    </customer>\n",
      "    \n",
      "\n",
      "    <customer ID=\"C003\">\n",
      "        <name>蒂法</name>\n",
      "        <phone>32467</phone>\n",
      "        <comments>\n",
      "<![CDATA[A small but healthy company.]]>        </comments>\n",
      "    </customer>\n",
      "</customers>\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 已有文本\n",
    "message = '''\n",
    "<customers>\n",
    "  <customer ID=\"C001\">\n",
    "    <name>Acme Inc.</name>\n",
    "    <phone>12345</phone>\n",
    "    <comments>\n",
    "      <![CDATA[Regular customer since 1995]]>\n",
    "    </comments>\n",
    "  </customer>\n",
    "</customers>\n",
    "'''\n",
    "\n",
    "domTree = parseString(message)\n",
    "# 文档根元素\n",
    "rootNode = domTree.documentElement\n",
    "\n",
    "# 新建一个customer节点\n",
    "customer_node = domTree.createElement(\"customer\")\n",
    "customer_node.setAttribute(\"ID\", \"C003\")\n",
    "\n",
    "# 创建name节点,并设置textValue\n",
    "name_node = domTree.createElement(\"name\")\n",
    "name_text_value = domTree.createTextNode(\"蒂法\")\n",
    "name_node.appendChild(name_text_value)  # 把文本节点挂到name_node节点\n",
    "customer_node.appendChild(name_node)\n",
    "\n",
    "# 创建phone节点,并设置textValue\n",
    "phone_node = domTree.createElement(\"phone\")\n",
    "phone_text_value = domTree.createTextNode(\"32467\")\n",
    "phone_node.appendChild(phone_text_value)  # 把文本节点挂到name_node节点\n",
    "customer_node.appendChild(phone_node)\n",
    "\n",
    "# 创建comments节点,这里是CDATA\n",
    "comments_node = domTree.createElement(\"comments\")\n",
    "cdata_text_value = domTree.createCDATASection(\n",
    "    \"A small but healthy company.\")\n",
    "comments_node.appendChild(cdata_text_value)\n",
    "customer_node.appendChild(comments_node)\n",
    "\n",
    "rootNode.appendChild(customer_node)\n",
    "print(rootNode.toprettyxml(indent='    ',newl='\\n'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 更新 XML 文件\n",
    "\n",
    "在更新 XML 时，只需先找到对应的元素结点，然后将其下的文本结点或属性取值更新即可，然后保存到文件，具体我就不多说了，代码中我将思路都注释清楚了，如下：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "domTree = parse(\"./customer.xml\")\n",
    "# 文档根元素\n",
    "rootNode = domTree.documentElement\n",
    "\n",
    "names = rootNode.getElementsByTagName(\"name\")\n",
    "for name in names:\n",
    "    if name.childNodes[0].data == \"Acme Inc.\":\n",
    "        # 获取到name节点的父节点\n",
    "        pn = name.parentNode\n",
    "        # 父节点的phone节点，其实也就是name的兄弟节点\n",
    "        # 可能有sibNode方法，我没试过，大家可以google一下\n",
    "        phone = pn.getElementsByTagName(\"phone\")[0]\n",
    "        # 更新phone的取值\n",
    "        phone.childNodes[0].data = 99999\n",
    "\n",
    "with open('updated_customer.xml', 'w') as f:\n",
    "    # 缩进 - 换行 - 编码\n",
    "    domTree.writexml(f, addindent='  ', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "参考：\n",
    "\n",
    "- https://docs.python.org/3.8/library/xml.dom.minidom.html\n",
    "- https://blog.csdn.net/qq_37174526/article/details/89489212\n",
    "- https://docs.python.org/3.8/library/xml.html#xml-vulnerabilities\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "189.6px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
